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Abstract 

Markov Chain Monte Carlo methods are widely used in signal processing and communications 
for statistical inference and stochastic optimization. In this work, we introduce an efficient adaptive 
MetropoUs-Hastings algorithm to draw samples from generic multi-modal and multi-dimensional target 
distributions. The proposal density is a mixture of Gaussian densities with all parameters (weights, mean 
vectors and covariance matrices) updated using all the previously generated samples applying simple 
recursive rules. Numerical results for the one and two-dimensional cases are provided to show the proper 
convergence of the proposal to the tai^get. 

Index Terms 



X 



Markov Chain Monte Carlo (MCMC), Gaussian mixtures, adaptive Metropolis-Hastings algorithms 

I. Introduction 

Markov Chain Monte Carlo (MCMC) methods | |Liu[ |2004[ [Liang et aT| |2010[ are ubiquitously used 
for performing inference and solving optimization problems in many scientific fields: statistics, digital 
communications, machine learning, signal processing, etc. [Robert and Casella 2004[ Wang et alj 2002 



Andrieu et al. 2003 Fitzgerald 2001 1. MCMC approaches are able to generate samples virtually from 
any target distribution (known up to a normalizing constant) by using a simpler proposal distribution. 
The basic underlying idea of standard MCMC techniques is producing a Markov chain that converges to 
the target. 
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The most famous MCMC technique is the MetropoUs-Hastings (MH) algorithm | Metropolis et al. 



1953[ Hastings! 1970 Robert and Casella) 2004[ . However, the main drawback of the MH method (and 



in general of all MCMC methods) is that the correlation among the samples in the Markov chain can 
be very high when the acceptance rate is low |Liu 2004 Liang et al.[ 2010 Martino and Mfguez[ 
2010[ . Correlated samples provide less statistical information and the resulting chain can remain trapped 



almost indefinitely in a local mode, meaning that convergence can be extremely slow. Therefore, since 
the correlation depends on the discrepancy between the target and proposal distributions, we would like 
it to be as close to the target as possible. 

Several extensions have been proposed in the literature to speed up the convergence and reduce the 
so called "burn-in" period. Among them, adaptive MH methods (i.e., MH algorithms with adaptive 
proposal distributions) are particularly interesting [ Liang et al.[ 2010[ Martino et al. 2012 1. Indeed, 
MCMC techniques usually need the selection of several parameters by the user before they can be 
applied to any particular problem. The use of adaptive proposals overcomes this issue, providing black- 
box algorithms with self-tuning capabilities. An adaptive MH technique improves the proposal distribution 
by learning at least some of its parameters from all the previously generated samples. Unfortunately, an 
important problem with the adaptation of the proposal is that the Markov property is lost and the invariant 
distribution of the chain could be disturbed. Hence, adaptive MH algorithms must be carefully designed 
to avoid this issue. 

An adaptive Metropolis that uses an adaptive random walk Gaussian proposal, was introduced in 



HHaario et aLj |200T| (we will denote it as AM method). The covariance matrix of the proposal is updated 
using recursive empirical estimators applied to the samples generated by the chain. The AM algorithm 
is an example of a partially adaptive MH approach, since it only updates the covariance of the proposal, 
whereas the mean of the Gaussian jumps to the current state of the chain at each iteration. An attempt 
of extending the AM algorithm by using a mixture of Gaussians as a proposal and updating all of its 



parameters (thus obtaining a fully adaptive MH algorithm) can be found in |Giordani and Kohn 2010 1. 
However, the resulting algorithm is quite complicated, and the proposal is only updated at some iterations. 

In this work, we introduce an independent MH technique where the proposal PDF is an adaptive 
mixture of Gaussians. All the parameters (weights, means and covariance matrices) of the Gaussians in 
the mixture are updated using empirical estimators with simple recursive formulas (i.e., our method is 
fully adaptive). After a training period, the proposal is adapted at every iteration. The resulting AGM-MH 
algorithm can be used to draw samples from arbitrary multi-modal and multi-dimensional targets, always 
improving the performance w.r.t. a non-adaptive MH scheme using the initial proposal. 
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The rest of the paper is organized as follows. Section |Tl] shows the problem formulation. Section III 



presents the proposed AGM-MH algorithm. Sections IV and |V] describe efficient parameter update rules 



and black-box usage. Finally, Sections VI and VII show the results and conclude the paper. 



II. Problem Formulation 
Let us assume that we need to draw samples from a (possibly multi-modal) generic d-dimensional 



target probability density function (PDF), Po(x), with support V C M"^. The AM algorithm [Haario et al. 



200 1[ uses an adaptive random walk MH with a Gaussian proposal, mean equal to the previous state of 



the chain (/i = xt_i), and covariance matrix, C, estimated from all previous states, i.e., g(x|xt_i, C) oc 
7\A(x|xt-i, C), where 



A/'(x|;x, C) oc exp f-^(x - /i)'^C ^(x-/i) 



(1) 



denotes a standard multi-variate Gaussian PDF. In order to improve the performance of the AM algorithm, 
here we consider a mixture of N Gaussians as the proposal PDF, i.e.. 



N 

q{x\w, /Xi:Ar, Ci:Ar) = ^ t(7jAA(x|/Xj, Q), (2) 

4=1 

where AA(x|//j, Cj) is given by ([T]), /ij = ^Uj^rf]^ is the dxl mean vector, Cj is the dxd positive 

definite covariance matrix, and w = [wi, li^Ar]"^ are the normalized weights (i.e., wi + ... + wn = 1)- 
Moreover, we define fii-.N = [fJ-i, IJ-n] and Ci:N = [Ci, Cjv]. We note that our approach is a 
fully adaptive MH algorithm, since (unlike the AM algorithm, which only updates the covariance) all the 
parameters in the mixture are learnt from all the previously generated samples. The resulting algorithm 
is very simple, since the adaptation is based on empirical estimators that can be implemented efficiently 
using recursive formulas. 

Since the adaptation could disturb the convergence of the generated chain to the target PDF, we consider 
the possibility of stopping it at an iteration Tstop- Hence, for t > Tgtop our algorithm is a standard MH with 
an improved proposal PDF w.r.t. the initial choice, thus providing a better performance and guaranteeing 



convergence. However, the numerical results described in Section VI show that the algorithm seems to 
maintain the correct ergodicity properties, so we always use Tstop = Ttot- A theoretical convergence proof 
is under development and will be included in future works. Finally, we note that degeneracy problems can 
appear during the first iterations in the update of the covariance matrices if we have a poor initialization. 
In order to avoid this issue, we allow the method to use a few iterations (t = 1, . . . ,Ttrain) to collect 
information about the target, assigning the produced state of the chain to the closest Gaussian in the 



mixture, as in [Haario et al. 2001 1. 
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III. AGM-MH Algorithm 

The proposed AGM-MH algorithm is described below. First of all, notice that, during the first Ttrain 
time steps the algorithm just assigns the current state of the chain to a Gaussian among the N in the 
mixture, according to the minimum Euclidean distance between x^ and the means fif ^\ i = 1,...,N. 
Afterwards, the algorithm updates all the parameters in the mixture until t = Tstop, when adaptation is 
stopped. In the description of the algorithm, the parameters are updated using a block procedure, but 
efficient recursive update formulas can be obtained, as shown in Section |IV] 

1) Initialization: 

a) Time instants: Set t = 0. Choose also the values of Ttrain < Ttot and Ttrain < T^top < Ttot- Let 
be Ttot the number of total iteration of the chain. 

b) Proposal: Choose the number of Gaussians N, as well the initial settings for = 
and cS°i = [cS°),...,c(^)].Set = ^1. 

c) Auxiliary parameters: Define S^^ = [s^P = /^P''], and = 1 represents the number of columns 
of S^'''* , with i = 1 , . . . , A^. Let e a small constant value and Id a identity matrix. 

2) MH steps: 

a) Sample x' from a mixture of Gaussian PDFs, 

x'~g,(xV(*\M-!v.cl:iv). 

b) Accept xt_|_i = x' with probability 



a = mm 



, 'Kxt)«(x'|w(*),Ax(*)^,c(l) 



(3) 



otherwise set x^+i = x^. 
3) If t < Tstop, update parameters of the proposal: 

a) Find the closest Gaussian to x^+i (w.r.t. Euclidean distance), i.e., find the index 

j = argmin \nl' - y^t+i\ ■ (4) 

i 

b) Set rrij = mj + 1 and update (adding a new column) the j-th auxiliary matrix 

s(*+^) = [sf,s(.'"^)=xi+i], (5) 

whereas S-*^^^ = S-*'', for all i ^ j. 

c) If t > Ttrain- Update the parameters of j-th Gaussian, 



m 



i=l 



then denoting as = S?+^) - /x^^^^^ 



e(*+i) ra(*+i)l'^ 



JTlj - 1 



(7) 



and set /x-*^^'' = C^-*^^^ = cf \ Vi / j. Since mj has been incremented, then update also 



the weights 



(8) 



sothat w(*+i) = 
4) If t < Ttot repeat from step 2. 
Observe that the proposal PDF is only updated for Ttrain < t < Tgtop- Moreover, note that the matrix S • 
(and S •*■'), for some i G {1, N}, has dimension dx rrii, so that C-*'' has always dimension dx d. The 
identity matrix eld is used just to avoid numerical problems (the matrix C^^*^ must be positive definite), 
as in [Haario et aLj|200T |. 



IV. Efficient Recursive Update of the Parameters 

To update the parameters of the selected (j-th) Gaussian PDF in the mixture, we can use recursive 
expressions. Indeed, Recalling that rrij = rrij + 1 is already updated and sj*"^^ = xj+i in step 
algorithm, ([6]) can be rewritten as 



3b 



of the 



and the Eq. ^ becomes 



Qit+l) _ 



nij - 1 



Finally, note that 



— — Xt+i H /x - , 

J m ■ m ■ 



m,- 



TV 



rrij — 1 



for Ttrain < t < Tstop, SO that 



k=l 



w} ' = —, i = l,...,N, 



(9) 



(10) 



(11) 



In this way, the novel technique becomes computationally efficient in high dimensional problems, as 
well. 
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V. AGM-MH AS BLACK-BOX METHOD 



The AGM-MH method shows sensitive dependence on the initial conditions. If some prior information 
about the target is available, it can be used to choose the initial parameters. If no prior informations is 
available, the AGM-MH can be applied as black-box algorithm in the following way: 

• Use a great number N of Gaussians (typically the number must be greater if the dimension d 



• Select randomly the means /x^.]^ in order to cover as possible the support domain V C M'^. 

• Choose diagonal covariance matrices cf'lf = cr^I^, with a big value of cr^ in order to explore the 
space P C M*^ in the train period, t < Ttrain- 

• The parameter Ttrain should be chosen bigger for greater dimensions d. Numerical results suggest 
that Ttrain = 100(i could be a suitable choice. In general, for more complicated target distributions 
a greater Ttrain could be needed. 

The use of a huge number of Gaussians does not generate computational problems since in the adaptation 
of the AGM-MH the weights of the irrelevant vanish quickly to zero. Therefore, the computational cost is 
controlled by the adaptation so that only the Gaussians located close to high probability regions survive. 
The useless Gaussian PDFs located faraway to the modes of the target are virtually discarded (the weights 
becomes quickly zero). Finally it is important to remark that, in general, the proposal is refined from the 
initial setting and the performance is improved. 



In this toy example, we apply the AGM-MH method to draw from a univariate bimodal target PDF 



that, clearly, has two modes at x = ±2. We set N = 2, number of Gaussian PDFs in the proposal PDF, 
= 0.5, and (al)^ = 10 with i = 1,2. The two initial means /if^ - U{[-4,0]), /xf^ ~ ^([0,4]) 
are chosen uniformly in [—4,0] and [0,4], respectively. We draw Ttot = 5000 iterations of the chain, 
and set Ttrain = 200, Tgtop = Ttot (i-c, the adaptation is never stopped). The initial state is randomly 
choose as xo ~ Ar(a;; 0, 1). We use all the generated samples to estimate the mean of the target (that is 



increases). 



VI. Simulations 



A. Example 1 



defined as 




(12) 
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since p{x) is symmetric). The mean square eiTor (MSE) of the estimation (averaged over 2000 runs) 
is 15 • 10^^. The estimated Unear correlation between contiguous samples is ^ 0.18. Without using 
any adaptation, namely using a standard MH, the correlation is 0.78. Hence, we can observe as the 
correlation decreases using the AGM-MH algorithm. 

(T ) (T ) 

The final averaged locations of the means of the proposal are -1.88, ^i'^ ° ~ The 



final weights of the mixture are w. 



0.5 and af ^ 0.16, i = 1,2. 
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T 
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1000 2000 3000 4000 5000 



(a) 





(b) 



(c) 



Fig. 1. (a) The averaged values of a as function of the iteration index t for different M = 2,3,6. For t > Ttrain, the a 
grows since the proposal is closer to the target, (b) The initial configuration of the means (squares) and the covariance matrices 
(ellipses), (c) The final configuration at t = Ttot = 7000. 
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B. Example 2 

For the sake of simplicity, we consider again an univariate target density. However, now we consider 
that target distribution is itself a mixture of Gaussian PDFs|^ Specifically, the target PDF is formed by 
M-Gaussians, i.e., 



where the weights are Cj = 1/M and the variances pf = A, i = 1, ...,M. We consider different 3 cases 
with M = 2, 3, 6. The means are, for each case, 

rji = -10, 10, for M = 2, 

7?i = -10,0,10, for M = 3, 

r;^ = -15, -10, -5,5, 10, 15 for M = 6, 

with i = 1,...,M. In the proposal we also use N = M Gaussians and each initial mean is chosen 
uniformly in [—20,20]. All the initial variances are set {a'^)^^^ = 10 and the weig hts wf^ = l/N, 



As in the first example, we draw Ttot = 5000 iterations of the chain, set Ttrain = 200 and Tgtop = Ttot 
(i.e., the adaptation is never stopped). The initial state of the chain is randomly choose as xq ~ N{x; 0, 1). 
We use all the generated samples to estimate the normahzing constant of the target. The mean square error 
(MSE) of the estimation (averaged over 1000 runs) is 1.6 • 10"^, 1.1 • 10"^ and 2 • 10"^ for M = 2, 3, 6 
respectively. The resulting correlation is 0.13, 0.14 and 0.16 for M = 2, 3, 6, respectively. With a standard 
MH (without adaptation) the correlation is 0.81, 0.72 and 0.46, for M = 2,3,6. Hence, another time, 
we remark as the adaptation of the AGM-MH reduces considerably the correlation among the generated 
samples. Finally, Fig. [TJa) depicts the averaged values of the acceptance function a in Eq. Q as function 
of t and for different M. In this case, for t > Ttrain, the averaged values of a increase because of the 
proposal becomes closer to the target PDF owing to the adaptation. 



M 




(13) 



j = l,...,N. 



'it is important to remark tliat we consider a mixture of Gaussians as a target PDF just to discuss the performance of the 
AGM-MH algorithm. More specifically, we desire to show as, in this case, the proposal convergences to real shape of the target, 
depending on the initial setting. However, clearly, the algorithm can be used to draw from any kind of target distribution. 
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C. Example 3 

In this example, our goal is drawing from a bivariate target PDF using the AGM-MH as a black- 
box technique. Just for simplicity, we also consider a bivariate mixture of M = 2 Gaussians as target 
distribution, with means r]i = [—2, —2]^, r}2 = [0, 4]^ and covariance matrices Xli = [0.3 0.1; 0.1 0.3], 
S2 = [0.8 -0.3; -0.3 0.8]. The weights are = 0.5, 0.5, z = 1, 2. We set Ttot = 7000, set Ttrain = 200 
and Tstop = Ttot (i-c, the adaptation is never stopped). First, we use for the proposal N = 2 Gaussian 
PDFs, ^ = 0.5, Cf ^ = lOId for i = l,2. The means are selected uniformly, m ~ U{[-5, 5] x [0, 5]) 
and fi2 ~ U{[—5, 5] x [—5, 0]). In this case, all the parameters of the mixture in the proposal convergences 
alwayas to the corresponding values of the mixture forming the target PDF. 

Moreover, we also consider the case with N = 10. We choose the means /ij of the proposal uniformly 
in the square [—5, 5] x [—5, 5]. In this situation, the AGM-MH refines the initial proposal PDF improving 
the parameters of the Gaussians in the mixture that are in good locations, whereas the weights of the 
unhelpful Gaussians decrease quickly to zero and their parameters remain invariant, as shown in Fig. [T] 

VII. Discussion 

We have proposed a novel adaptive independent MH algorithm (AGM-MH) to draw samples from 



arbitrary multi-modal and multi-dimensional targets. AGM-MH builds on the work of [Haario et al. 



2001[ , extending it by using a Gaussian mixture proposal and updating also the means and the weights 



of the Gaussians. Compared to a previous extension provided by IGiordani and Kohn 20101, our approach 



is more efficient, updates the proposal at every iteration instead of only at a fixed number of iterations. 
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